-
Notifications
You must be signed in to change notification settings - Fork 5
[wip] one less allocation/copy in Pipe #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to changing the transport buffer interface, but the burden of this PR is to:
- ensure resharding tests don't break (and that we don't drop support for contiguous support)
- prove this is better
- prove that this is faster
Arguably the interface is too biased towards rdma buffer, but any new interface needs to be generically supported against all backends (incl gloo in flight).
If your goal is to make rdma buffer faster, can you use test_models.py and test whether this is faster?
await self.storage_volume.get.call_one( | ||
key, transport_buffer, request.meta_only() | ||
) | ||
transport_buffer = await self.storage_volume.get.call_one( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're creating a race condition here -- memory is often created on the fly in storage volume to deal with non-contiguous tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In storage volume, all the tensors are already contiguous, and it's just handing out RDMABuffers pointing to those tensors.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #57 +/- ##
=======================================
Coverage ? 61.01%
=======================================
Files ? 22
Lines ? 1698
Branches ? 0
=======================================
Hits ? 1036
Misses ? 662
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I think TransportBuffer shouldn't need to allocate anything and shouldn't own anything. It's easier to reason about it if we simply treat it as a "remote reference" to a tensor of sorts.
Will need more evidence but see below for result with test_models.py
I agree, for example, with gloo, we can probably make non-contiguous tensor works without extra allocation. I think we can always change the interface later to add something less restrictive than
Yes, on slurm single node. Put went from 7 seconds to 4.3 seconds. I will run 32B in forge to double check.
after: https://www.internalfb.com/phabricator/paste/view/P1990993018
|
From forge e2e run on slurm 32b multi-node:
with patch:
|
ptal @LucasLLC |
No description provided.